Objective

The objective of this analytical report is to help companies identify good employees who are at risk of leaving the company. With this information, companies can allocate their finances and resources on in areas that can help in retaining good employees.

Analysis Process

First, we will analyze and visualize the data to get a basic understanding of the data inhand (Human Resources Analytics by Ludovic Benistant from kaggle.com). After obtaining a basic understanding of the data, we will check the correlation of the factors to identify and interpret the key factors that drive employees to leave.

Second, we will segment the entire employees by using the cluster method to observe which cluster of employees have a higher possbility of leaving.

Finally, we will bucket the employees (excluding the ones who have stayed) across two dimensions, performance and risk of leaving, in order to predict and identify the employees companies generally wish to retain even at a higher cost - high performing employees with high risk of leaving (and maybe even identify the low performing employees with low possiblity of leaving). This will help the company to target and invest in their human resources and reduce the risk and negative impact of losing high performing employees.

1. Data check and Visualisation

1.1 Load and Explore the data

First, let’s load the data to use.

ProjectData <- read.csv("./data/HR_data.csv")
ProjectData = data.matrix(ProjectData)

Description of the data Can we slightly rename the titles of the data in the excel file - or is it too complicated now?

  1. Employee satisfaction level
  2. Last evaluation
  3. Number of projects
  4. Average monthly hours
  5. Time spent at the company
  6. Whether they have had a work accident
  7. Whether they have had a promotion in the last 5 years
  8. Department
  9. Salary (1=low, 2=medium, 3=high)
  10. Whether employee has left

This is how the first 10 set of data (employees) look like.

Obs.01 Obs.02 Obs.03 Obs.04 Obs.05 Obs.06 Obs.07 Obs.08 Obs.09 Obs.10
satisfaction_level 0.38 0.80 0.11 0.72 0.37 0.41 0.10 0.92 0.89 0.42
last_evaluation 0.53 0.86 0.88 0.87 0.52 0.50 0.77 0.85 1.00 0.53
number_project 2.00 5.00 7.00 5.00 2.00 2.00 6.00 5.00 5.00 2.00
average_montly_hours 157.00 262.00 272.00 223.00 159.00 153.00 247.00 259.00 224.00 142.00
time_spend_company 3.00 6.00 4.00 5.00 3.00 3.00 4.00 5.00 5.00 3.00
Work_accident 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
left 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00
promotion_last_5years 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
salary_level 1.00 2.00 2.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00
sales 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00
accounting 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
hr 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
technical 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
support 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
management 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
IT 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
product_mng 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
marketing 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
RandD 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00

The data we use here have the following descriptive statistics.

min 25 percent median mean 75 percent max std
satisfaction_level 0.09 0.44 0.64 0.61 0.82 1 0.25
last_evaluation 0.36 0.56 0.72 0.72 0.87 1 0.17
number_project 2.00 3.00 4.00 3.80 5.00 7 1.23
average_montly_hours 96.00 156.00 200.00 201.05 245.00 310 49.94
time_spend_company 2.00 3.00 3.00 3.50 4.00 10 1.46
Work_accident 0.00 0.00 0.00 0.14 0.00 1 0.35
left 0.00 0.00 0.00 0.24 0.00 1 0.43
promotion_last_5years 0.00 0.00 0.00 0.02 0.00 1 0.14
salary_level 1.00 1.00 2.00 1.59 2.00 3 0.64
sales 0.00 0.00 0.00 0.28 1.00 1 0.45
accounting 0.00 0.00 0.00 0.05 0.00 1 0.22
hr 0.00 0.00 0.00 0.05 0.00 1 0.22
technical 0.00 0.00 0.00 0.18 0.00 1 0.39
support 0.00 0.00 0.00 0.15 0.00 1 0.36
management 0.00 0.00 0.00 0.04 0.00 1 0.20
IT 0.00 0.00 0.00 0.08 0.00 1 0.27
product_mng 0.00 0.00 0.00 0.06 0.00 1 0.24
marketing 0.00 0.00 0.00 0.06 0.00 1 0.23
RandD 0.00 0.00 0.00 0.05 0.00 1 0.22

1.2 Scale the data

Here, we are standardizing the data in order to avoid having the problem of the result being driven by a few relatively large values. We will scale the data between 0 and 1.

ProjectDataFactor_scaled = apply(ProjectDataFactor, 2, function(r) {
    res = (r - min(r))/(max(r) - min(r))
    res
})

Below is the summary statistics of the scaled dataset.

min 25 percent median mean 75 percent max std
satisfaction_level 0 0.38 0.60 0.57 0.80 1 0.27
last_evaluation 0 0.31 0.56 0.56 0.80 1 0.27
number_project 0 0.20 0.40 0.36 0.60 1 0.25
average_montly_hours 0 0.28 0.49 0.49 0.70 1 0.23
time_spend_company 0 0.12 0.12 0.19 0.25 1 0.18
Work_accident 0 0.00 0.00 0.14 0.00 1 0.35
left 0 0.00 0.00 0.24 0.00 1 0.43
promotion_last_5years 0 0.00 0.00 0.02 0.00 1 0.14
salary_level 0 0.00 0.50 0.30 0.50 1 0.32
sales 0 0.00 0.00 0.28 1.00 1 0.45
accounting 0 0.00 0.00 0.05 0.00 1 0.22
hr 0 0.00 0.00 0.05 0.00 1 0.22
technical 0 0.00 0.00 0.18 0.00 1 0.39
support 0 0.00 0.00 0.15 0.00 1 0.36
management 0 0.00 0.00 0.04 0.00 1 0.20
IT 0 0.00 0.00 0.08 0.00 1 0.27
product_mng 0 0.00 0.00 0.06 0.00 1 0.24
marketing 0 0.00 0.00 0.06 0.00 1 0.23
RandD 0 0.00 0.00 0.05 0.00 1 0.22

1.3 Check Correlations

The simplest way to have a first look at a dataset is to check the correlation. By doing this, we can easily see which factors have a high positive/negative correlation with leaving employees. This is different from a causality, therefore we cannot conclude that a highly correlated factor (independent variables) leads an employee to leave (dependent variable). Also, if some of the factors (independent variables) are highly correlated with each other, we could also consider to group these attributes together.

satisfaction_level last_evaluation number_project average_montly_hours time_spend_company Work_accident left promotion_last_5years salary_level sales accounting hr technical support management IT product_mng marketing RandD
satisfaction_level 1.00 0.11 -0.14 -0.02 -0.10 0.06 -0.39 0.03 0.05 0.00 -0.03 -0.01 -0.01 0.01 0.01 0.01 0.01 0.01 0.01
last_evaluation 0.11 1.00 0.35 0.34 0.13 -0.01 0.01 -0.01 -0.01 -0.02 0.00 -0.01 0.01 0.02 0.01 0.00 0.00 0.00 -0.01
number_project -0.14 0.35 1.00 0.42 0.20 0.00 0.02 -0.01 0.00 -0.01 0.00 -0.03 0.03 0.00 0.01 0.00 0.00 -0.02 0.01
average_montly_hours -0.02 0.34 0.42 1.00 0.13 -0.01 0.07 0.00 0.00 0.00 0.00 -0.01 0.01 0.00 0.00 0.01 -0.01 -0.01 0.00
time_spend_company -0.10 0.13 0.20 0.13 1.00 0.00 0.14 0.07 0.05 0.02 0.00 -0.02 -0.03 -0.03 0.12 -0.01 0.00 0.01 -0.02
Work_accident 0.06 -0.01 0.00 -0.01 0.00 1.00 -0.15 0.04 0.01 0.00 -0.01 -0.02 -0.01 0.01 0.01 -0.01 0.00 0.01 0.02
left -0.39 0.01 0.02 0.07 0.14 -0.15 1.00 -0.06 -0.16 0.01 0.02 0.03 0.02 0.01 -0.05 -0.01 -0.01 0.00 -0.05
promotion_last_5years 0.03 -0.01 -0.01 0.00 0.07 0.04 -0.06 1.00 0.10 0.01 0.00 0.00 -0.04 -0.04 0.13 -0.04 -0.04 0.05 0.02
salary_level 0.05 -0.01 0.00 0.00 0.05 0.01 -0.16 0.10 1.00 -0.04 0.01 0.00 -0.02 -0.03 0.16 -0.01 -0.01 0.01 0.00
sales 0.00 -0.02 -0.01 0.00 0.02 0.00 0.01 0.01 -0.04 1.00 -0.14 -0.14 -0.29 -0.26 -0.13 -0.18 -0.16 -0.15 -0.15
accounting -0.03 0.00 0.00 0.00 0.00 -0.01 0.02 0.00 0.01 -0.14 1.00 -0.05 -0.11 -0.10 -0.05 -0.07 -0.06 -0.06 -0.05
hr -0.01 -0.01 -0.03 -0.01 -0.02 -0.02 0.03 0.00 0.00 -0.14 -0.05 1.00 -0.11 -0.10 -0.05 -0.07 -0.06 -0.06 -0.05
technical -0.01 0.01 0.03 0.01 -0.03 -0.01 0.02 -0.04 -0.02 -0.29 -0.11 -0.11 1.00 -0.20 -0.10 -0.14 -0.12 -0.12 -0.11
support 0.01 0.02 0.00 0.00 -0.03 0.01 0.01 -0.04 -0.03 -0.26 -0.10 -0.10 -0.20 1.00 -0.09 -0.12 -0.11 -0.10 -0.10
management 0.01 0.01 0.01 0.00 0.12 0.01 -0.05 0.13 0.16 -0.13 -0.05 -0.05 -0.10 -0.09 1.00 -0.06 -0.05 -0.05 -0.05
IT 0.01 0.00 0.00 0.01 -0.01 -0.01 -0.01 -0.04 -0.01 -0.18 -0.07 -0.07 -0.14 -0.12 -0.06 1.00 -0.08 -0.07 -0.07
product_mng 0.01 0.00 0.00 -0.01 0.00 0.00 -0.01 -0.04 -0.01 -0.16 -0.06 -0.06 -0.12 -0.11 -0.05 -0.08 1.00 -0.06 -0.06
marketing 0.01 0.00 -0.02 -0.01 0.01 0.01 0.00 0.05 0.01 -0.15 -0.06 -0.06 -0.12 -0.10 -0.05 -0.07 -0.06 1.00 -0.06
RandD 0.01 -0.01 0.01 0.00 -0.02 0.02 -0.05 0.02 0.00 -0.15 -0.05 -0.05 -0.11 -0.10 -0.05 -0.07 -0.06 -0.06 1.00

The most significant variable is ‘Satisfaction level’, which is strongly negatively correlated with employees leaving, (input number), which is quite obvious. The satisfaction level is also negatively correlated with time spent at the company, and number of projects. This can be interpreted as ‘the longer the employee has stayed at the company, the lower the level of satisfaction’, which indicates that the company may be lacking in providing long term goals or visions. Being invloved in a lot of projects is also quite highly correlated to employees leaving. However, since long working hours actually have not much correlation with attrition, we can also infer that being invloved in too many tasks and being disorganized and distracted causes lower satisfactory level than simply long working hours.

2. Cluster Analysis and Segmentation

2.1 (1st try) Select segmentation variables and methods

We use all the variables except “Whether the employee has left.” We use Euclidean distance.

segmentation_attributes_used = c(1:6, 8:19)
profile_attributes_used = c(1:19)
numb_clusters_used = 5
profile_with = "hclust"
distance_used = "euclidean"
hclust_method = "ward.D"

Here are the differences between the observations using the distance metric we selected:

Obs.01 Obs.02 Obs.03 Obs.04 Obs.05 Obs.06 Obs.07 Obs.08 Obs.09 Obs.10
Obs.01 0.00
Obs.02 1.21 0.00
Obs.03 1.39 0.89 0.00
Obs.04 0.97 0.55 0.96 0.00
Obs.05 0.02 1.22 1.39 0.98 0.00
Obs.06 0.06 1.23 1.43 0.99 0.06 0.00
Obs.07 1.03 0.98 0.58 0.75 1.03 1.07 0.00
Obs.08 1.12 0.53 1.11 0.28 1.13 1.13 0.94 0.00
Obs.09 1.17 0.60 1.12 0.28 1.18 1.19 0.97 0.29 0.00
Obs.10 0.08 1.23 1.43 0.98 0.10 0.07 1.08 1.13 1.17 0

2.2 (1st try) Visualize Pair-wise Distances

We can see the histogram of, say, the first 2 variables.

or the histogram of all pairwise distances for the euclidean distance:

2.3 (1st try) Number of Segments

Let’s use Hierarchical Clustering methods. It may be useful to see the dendrogram from , to have a quick idea of how the data may be segmented and how many segments there may be. Here is the dendrogram for our data:

We can also plot the “distances” traveled before we need to merge any of the lower and smaller in size clusters into larger ones - the heights of the tree branches that link the clusters as we traverse the tree from its leaves to its root. If we have n observations, this plot has n-1 numbers, we see the first 20 here.

For now let’s consider the 4-segments solution. We can also see the segment each observation (respondent in this case) belongs to for the first 20 people:

Observation Number Cluster_Membership
1 1
2 1
3 1
4 1
5 1
6 1
7 1
8 1
9 1
10 1
11 1
12 1
13 1
14 1
15 1
16 1
17 1
18 1
19 2
20 1

2.4 (1st try) Profile and interpret the segments

Having decided how many clusters to use, we would like to get a better understanding of who the customers in those clusters are and interpret the segments.

Let’s see first how many observations we have in each segment, for the segments we selected above:

Segment 1 Segment 2 Segment 3 Segment 4
Number of Obs. 4040 6058 2692 2209

The average values of our data for the total population as well as within each customer segment are:

Population Segment 1 Segment 2 Segment 3 Segment 4
satisfaction_level 0.57 0.57 0.58 0.57 0.58
last_evaluation 0.56 0.55 0.56 0.56 0.57
number_project 0.36 0.36 0.36 0.38 0.36
average_montly_hours 0.49 0.49 0.49 0.50 0.49
time_spend_company 0.19 0.19 0.19 0.18 0.17
Work_accident 0.14 0.14 0.15 0.14 0.15
left 0.24 0.25 0.22 0.26 0.25
promotion_last_5years 0.02 0.00 0.05 0.00 0.00
salary_level 0.30 0.27 0.33 0.28 0.27
sales 0.28 1.00 0.02 0.00 0.00
accounting 0.05 0.00 0.13 0.00 0.00
hr 0.05 0.00 0.12 0.00 0.00
technical 0.18 0.00 0.00 1.00 0.00
support 0.15 0.00 0.00 0.00 1.00
management 0.04 0.00 0.10 0.00 0.00
IT 0.08 0.00 0.20 0.00 0.00
product_mng 0.06 0.00 0.15 0.00 0.00
marketing 0.06 0.00 0.14 0.00 0.00
RandD 0.05 0.00 0.13 0.00 0.00

The segment profile looks to depend too much on department. Let’s try the analysis again excluding department information.

2.1 (2nd try) Select segmentation variables and methods

We use all the variables except “Whether the employee has left.” and department. We use Euclidean distance.

segmentation_attributes_used = c(1:6, 8:9)
profile_attributes_used = c(1:19)
numb_clusters_used = 5
profile_with = "hclust"
distance_used = "euclidean"
hclust_method = "ward.D"

Here are the differences between the observations using the distance metric we selected:

Obs.01 Obs.02 Obs.03 Obs.04 Obs.05 Obs.06 Obs.07 Obs.08 Obs.09 Obs.10
Obs.01 0.00
Obs.02 1.21 0.00
Obs.03 1.39 0.89 0.00
Obs.04 0.97 0.55 0.96 0.00
Obs.05 0.02 1.22 1.39 0.98 0.00
Obs.06 0.06 1.23 1.43 0.99 0.06 0.00
Obs.07 1.03 0.98 0.58 0.75 1.03 1.07 0.00
Obs.08 1.12 0.53 1.11 0.28 1.13 1.13 0.94 0.00
Obs.09 1.17 0.60 1.12 0.28 1.18 1.19 0.97 0.29 0.00
Obs.10 0.08 1.23 1.43 0.98 0.10 0.07 1.08 1.13 1.17 0

2.2 (2nd try) Visualize Pair-wise Distances

Let us skip this subsection for the 2nd try.

2.3 (2nd try) Number of Segments

Let’s use Hierarchical Clustering methods. It may be useful to see the dendrogram from , to have a quick idea of how the data may be segmented and how many segments there may be. Here is the dendrogram for our data:

We can also plot the “distances” traveled before we need to merge any of the lower and smaller in size clusters into larger ones - the heights of the tree branches that link the clusters as we traverse the tree from its leaves to its root. If we have n observations, this plot has n-1 numbers, we see the first 20 here.

For now let’s consider the 5-segments solution. We can also see the segment each observation (respondent in this case) belongs to for the first 20 people:

Observation Number Cluster_Membership
1 1
2 2
3 3
4 4
5 1
6 1
7 3
8 4
9 4
10 1
11 1
12 3
13 4
14 1
15 1
16 1
17 1
18 4
19 3
20 4

2.4 (2nd try) Profile and interpret the segments

Having decided how many clusters to use, we would like to get a better understanding of who the customers in those clusters are and interpret the segments.

Let’s see first how many observations we have in each segment, for the segments we selected above:

Segment 1 Segment 2 Segment 3 Segment 4 Segment 5
Number of Obs. 1816 4172 2728 4190 2093

The average values of our data for the total population as well as within each customer segment are:

Population Segment 1 Segment 2 Segment 3 Segment 4 Segment 5
satisfaction_level 0.57 0.37 0.71 0.28 0.70 0.61
last_evaluation 0.56 0.26 0.60 0.61 0.61 0.55
number_project 0.36 0.04 0.37 0.57 0.36 0.36
average_montly_hours 0.49 0.24 0.51 0.60 0.51 0.48
time_spend_company 0.19 0.12 0.14 0.36 0.16 0.19
Work_accident 0.14 0.00 0.00 0.03 0.00 1.00
left 0.24 0.77 0.10 0.35 0.15 0.08
promotion_last_5years 0.02 0.00 0.00 0.12 0.00 0.00
salary_level 0.30 0.25 0.60 0.33 0.00 0.30
sales 0.28 0.29 0.25 0.29 0.29 0.27
accounting 0.05 0.06 0.05 0.06 0.05 0.05
hr 0.05 0.07 0.05 0.05 0.05 0.04
technical 0.18 0.17 0.18 0.18 0.19 0.18
support 0.15 0.16 0.16 0.12 0.15 0.16
management 0.04 0.03 0.05 0.07 0.02 0.04
IT 0.08 0.07 0.09 0.07 0.09 0.08
product_mng 0.06 0.06 0.06 0.05 0.07 0.06
marketing 0.06 0.06 0.06 0.06 0.05 0.06
RandD 0.05 0.04 0.06 0.05 0.05 0.06

Everyone in the Segment 5 had work accident, which does not look good segmentations. Let’s try the analysis again excluding work accident information.

2.1 (3rd try) Select segmentation variables and methods

We use all the variables except “Whether the employee has left,” department, and work accident. We use Euclidean distance.

segmentation_attributes_used = c(1:5, 8:9)
profile_attributes_used = c(1:19)
numb_clusters_used = 4
profile_with = "hclust"
distance_used = "euclidean"
hclust_method = "ward.D"

Here are the differences between the observations using the distance metric we selected:

Obs.01 Obs.02 Obs.03 Obs.04 Obs.05 Obs.06 Obs.07 Obs.08 Obs.09 Obs.10
Obs.01 0.00
Obs.02 1.21 0.00
Obs.03 1.39 0.89 0.00
Obs.04 0.97 0.55 0.96 0.00
Obs.05 0.02 1.22 1.39 0.98 0.00
Obs.06 0.06 1.23 1.43 0.99 0.06 0.00
Obs.07 1.03 0.98 0.58 0.75 1.03 1.07 0.00
Obs.08 1.12 0.53 1.11 0.28 1.13 1.13 0.94 0.00
Obs.09 1.17 0.60 1.12 0.28 1.18 1.19 0.97 0.29 0.00
Obs.10 0.08 1.23 1.43 0.98 0.10 0.07 1.08 1.13 1.17 0

2.2 (3rd try) Visualize Pair-wise Distances

Let us skip this subsection for the 3rd try.

2.3 (3rd try) Number of Segments

Let’s use Hierarchical Clustering methods. It may be useful to see the dendrogram from , to have a quick idea of how the data may be segmented and how many segments there may be. Here is the dendrogram for our data:

We can also plot the “distances” traveled before we need to merge any of the lower and smaller in size clusters into larger ones - the heights of the tree branches that link the clusters as we traverse the tree from its leaves to its root. If we have n observations, this plot has n-1 numbers, we see the first 20 here.

For now let’s consider the 4-segments solution. We can also see the segment each observation (respondent in this case) belongs to for the first 20 people:

Observation Number Cluster_Membership
1 1
2 2
3 3
4 4
5 1
6 1
7 3
8 4
9 4
10 1
11 1
12 3
13 4
14 1
15 1
16 1
17 1
18 4
19 2
20 4

2.4 (3rd try) Profile and interpret the segments

Having decided how many clusters to use, we would like to get a better understanding of who the customers in those clusters are and interpret the segments.

Let’s see first how many observations we have in each segment, for the segments we selected above:

Segment 1 Segment 2 Segment 3 Segment 4
Number of Obs. 1930 6017 2135 4917

The average values of our data for the total population as well as within each customer segment are:

Population Segment 1 Segment 2 Segment 3 Segment 4
satisfaction_level 0.57 0.37 0.70 0.12 0.70
last_evaluation 0.56 0.26 0.58 0.67 0.59
number_project 0.36 0.04 0.36 0.65 0.36
average_montly_hours 0.49 0.24 0.50 0.65 0.51
time_spend_company 0.19 0.12 0.20 0.29 0.15
Work_accident 0.14 0.08 0.17 0.12 0.16
left 0.24 0.76 0.08 0.47 0.13
promotion_last_5years 0.02 0.00 0.05 0.00 0.00
salary_level 0.30 0.26 0.57 0.26 0.00
sales 0.28 0.30 0.27 0.27 0.28
accounting 0.05 0.06 0.05 0.06 0.05
hr 0.05 0.07 0.04 0.05 0.05
technical 0.18 0.17 0.17 0.19 0.19
support 0.15 0.15 0.15 0.14 0.15
management 0.04 0.02 0.06 0.04 0.02
IT 0.08 0.07 0.08 0.08 0.08
product_mng 0.06 0.06 0.06 0.05 0.07
marketing 0.06 0.06 0.06 0.05 0.05
RandD 0.05 0.04 0.05 0.05 0.06

2.5 Implications

3. Drivers of Leaving Company

3.1 Classification tree

dependent_variable = 7
independent_variables = c(1:5, 8:9)

Probability_Threshold = 0.5

estimation_data_percent = 80
validation_data_percent = 10

random_sampling = 0

# Tree (CART) complexity control cp (e.g. 0.001 to 0.02, depending on the
# data)
CART_cp = 0.01

# the minimum size of a segment for the analysis to be done only for that
# segment
min_segment = 100

4. Business Decisions

5. Future Work

First, companies can implement policies to control attrition rates, by managing the variables that have a high correlation with employees leaving. For example, companies could work to increase the satisfactory level of employees, especially in the variables that lead to low satisfactory level. To encourage the employees to stay longer, companies could share clear long term visions of the company to help employees envision their future together with the company. Companies could also implement strict working policies where employees would be restricted from joining above a certain number of projects, allowing them to deeply focus on only a few projects and therefore find more meaningful value and satisfaction.

Second, companies can use the prediction model to retain the high performing employees with high risk of leaving. END.

Conclusion